1 Introduction

This notebook contains all code for parsing and analysis the protein clusters generated using MMSeq2.

2 Method

To explore the degree of sequence divergencen and similarity in each of the clusters created by MMSeq2, all-versus-all BLASTP analysis was performed. In every instance, the script run_blastp.py from the Python package pyrewton DOI:10.5281/zenodo.3876218) was used to run the BLASTP all-versus-all analysis.

The table compiled by BLASTP contains the following columns: - qseqid - sseqid - pident - length - mismatch - gapopen - qstart - qend - sstart - send - evalue - bitscore

In order to compare each of the pair-wise-alignment we need to calculate the Blast Score Ratio (SCR) to normalise for length. This was first presented by Rasko et al., 2005.

The bitscore reported by BLAST is the sum of the qualities of the aligned symbols over the whole alignment. This is an accurate measure of the alignment strength, but long sequences tend to have higher bitscores than short sequences, even when the matches are of about the same quality. To correct for this length effect, we can calculate a normalised bitscore where:

normalised bitscore = bitscore / query length

3 Preliminary comparison between clusters

A representative sequence from each of the 4 largest clusters compiled using MMSeq2 (with a percentage identity and coverage of cut-off of 70%) was extracted from a local CAZyme database using cazy_webscraper.

Each representative sequence was identified by using the GenBank accession assigned as the name of the cluster by MMSeq2.

Table 3.1 presents the raw output from BLASTP as well as the BSR from the all-verus-all BLASTP analysis of the representative sequences from the 4 largest clusters compiled by MMSeq2.

Table 3.1: Output from BLASTP all-vs-all analysis of the representative sequences from the 4 larges protein clusters
qseqid sseqid pident cov qlen slen alen bitscore evalue bsr
CBK69950.1 CBK69950.1 100.000 100 539 539 539 1121 0 2.0797774
CBK69950.1 AGE22437.1 49.171 100 539 567 543 520 0 0.9647495
CBK69950.1 CDG29680.1 49.631 99 539 533 542 520 0 0.9647495
CBK69950.1 QJR11213.1 44.383 99 539 534 543 412 0 0.7643785
QJR11213.1 QJR11213.1 100.000 100 534 534 534 1107 0 2.0730337
QJR11213.1 CDG29680.1 47.532 99 534 533 547 477 0 0.8932584
QJR11213.1 AGE22437.1 49.358 99 534 567 545 473 0 0.8857678
QJR11213.1 CBK69950.1 44.954 99 534 539 545 417 0 0.7808989
CDG29680.1 CDG29680.1 100.000 100 533 533 533 1118 0 2.0975610
CDG29680.1 AGE22437.1 66.417 99 533 567 533 753 0 1.4127580
CDG29680.1 CBK69950.1 49.631 99 533 539 542 520 0 0.9756098
CDG29680.1 QJR11213.1 47.091 99 533 534 550 471 0 0.8836773
AGE22437.1 AGE22437.1 100.000 100 567 567 567 1179 0 2.0793651
AGE22437.1 CDG29680.1 66.417 94 567 533 533 753 0 1.3280423
AGE22437.1 CBK69950.1 49.171 94 567 539 543 520 0 0.9171076
AGE22437.1 QJR11213.1 49.088 93 567 534 548 469 0 0.8271605

Figure 3.1 is an interactive plot presenting the BSR from the all-verus-all BLASTP analysis of the representative sequences from the 4 largest clusters compiled by MMSeq2.

To view the specific BSR for each comparison, hover over the plot and a tooltip will appear and will present the GenBank accessions of the corresponding proteins as well as the specific BSR value (to 3dp).

Figure 3.1: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

The cluster CBK69950.1 contained 33 protein sequences, QJR11213.1 28 protein sequences, CDG29680.1 17 protein sequences, and AGE22437.1 contaiend 13 protein sequences.

The BSR infer that the two smaller clusters (CDG29680.1 and AGE22437.1) could potentially be combined to create a large cluster. The proteins in this new combined clusters could then be aligned to create an multisequence alignment (MSA) of functionally relevant proteins for molecular modeling. Conversely, the BSR inferred the two larger clusters should be kept separate ( and not combined) to create a high quality MSA (i.e. minimal gaps and readily identifiable consensus sequences) of the proteins within each cluster.

4 Sequence divergence in individual clusters

To explore the sequence divergence in each of the 4 largest clusters created by MMSeq2, for each cluster the protein sequences were retrieved using cazy_webscraper and were written to a FASTA file. The protein sequences were then compared to one another in a BLASTP all-versus-all analysis.

4.1 AGE22437.1

Figure 4.1 presents the heatmap of the all-versus-all BLASTP analysis of the 13 protein sequences in the AGE22437.1 cluster. The sequence BAM46395.1 has the least sequence similarity to all other members of the cluster. Overall, the cluster has a relatively similar sequence as inferred from the BSR.

Figure 4.1: Heatmap of all-versus-all BLASTP of the AGE22437.1 cluster

4.2 CBK6950.1

Figure 4.2 presents the heatmap of the all-versus-all BLASTP analysis of the 33 protein sequences in the CBK6950.1 cluster. Except for 2 instances, every pair-wise alignment produced a BSR of greater than 1.3 with many with a BRS greater than 1.5. This infers a strong sequence similarity across all proteins in the cluster. However,
- CBL10126.1 against VEI47713.1 - CBL13352.1 against VEI47713.1 - VEI47713.1 against CBL10126.1 - VEI47713.1 against CBL13352.1 produced BRS of 0.04.

Figure 4.2: Heatmap of all-versus-all BLASTP of the CBK6950.1 cluster

4.3 CDG29680.1

Figure 4.3 presents the heatmap of the all-versus-all BLASTP analysis of the 17 protein sequences in the CDG29680.1 cluster. As with CBK6950.1, overall all proteins in the cluster have a similar protein sequence to one another as inferred from the high (greater than 1.4) BSR. However, there are a few instances of pairwise alignments with BSRs of less than 0.5.

Figure 4.3: Heatmap of all-versus-all BLASTP of the CDG29680.1 cluster

4.4 QJR11213.1

Figure 4.4 presents the heatmap of the all-versus-all BLASTP analysis of the 28 protein sequences in the QJR11213.1 cluster. The heatmap shows clusters of proteins with high sequence similarity, as inferred from the high (greater than 1.8) BSR. AK102785.1, QJR11213.1 and QJR15254.1 both relatively lower BSRs of approximately 1.4 against all other proteins in the cluster. However, a BSR of 1.4 is still a relatively high BSR, therefore, there is relatively little protein sequence variation in this cluster.

Figure 4.4: Heatmap of all-versus-all BLASTP of the QJR11213.1 cluster

5 Combining clusters

5.1 AGE22437.1 and CDG29680.1

Figure 4.1 inferred that the clusters AGE22437.1 and CDG29680.1 could potentially be combined. To further explore this the a BLASTP all-versus-all analysis of all proteins in both clusters was performed. The BSRs from this analysis are presented in figure @ref(fig:AGE22437CDG29680}, and the plot demonstrates that the vast majoirty of pair-wise alignements produced a BSR of greater than 1.4. This infers relatively sequence divergence between all proteins across the two clusters, and therefore, the two clusters can be combined.

Figure 5.1: Heatmap of all-versus-all BLASTP of the proteins from both AGE22437.1 and CDG29680.1 clusters

5.2 Sequence divergence across all 4 clusters

To explore the possibility of creating a MSA with a consensus across the majority of proteins when aligning proteins from all 4 clusters (AGE22437.1, CBK6950.1, CDG29680.1 and QJR11213.1), a all-versus-all BLASTP analysis of all proteins in all 4 clusters was performed. The BSR of all pairwise alignments are presented in figure ??. Clear clusters of proteins with high BSRs (greater than 1.5) were observed, and these clusters included proteins from across multiple clusters. The majority of pairwise alignments scores a BSR of 1 and greater. Consequently, it was hypothesised that all 4 clusters could be combined to generate a MSA of functionally relevant proteins for molecular modeling.

Figure 5.2: Heatmap of all-versus-all BLASTP of the proteins from both AGE22437.1, CBK6950.1, CDG29680.1 and QJR11213.1 clusters